Robust, Lexicalized Native Language Identification

نویسندگان

  • Julian Brooke
  • Graeme Hirst
چکیده

Previous approaches to the task of native language identification (Koppel et al., 2005) have been limited to small, within-corpus evaluations. Because these are restrictive and unreliable, we apply cross-corpus evaluation to the task. We demonstrate the efficacy of lexical features, which had previously been avoided due to the within-corpus topic confounds, and provide a detailed evaluation of various options, including a simple bias adaptation technique and a number of classifier algorithms. Using a new web corpus as a training set, we reach high classification accuracy for a 7-language task, performance which is robust across two independent test sets. Although we show that even higher accuracy is possible using crossvalidation, we present strong evidence calling into question the validity of cross-validation evaluation using the standard dataset.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploring Syntactic Features for Native Language Identification: A Variationist Perspective on Feature Encoding and Ensemble Optimization

In this paper, we systematically explore lexicalized and non-lexicalized local syntactic features for the task of Native Language Identification (NLI). We investigate different types of feature representations in singleand cross-corpus settings, including two representations inspired by a variationist perspective on the choices made in the linguistic system. To combine the different models, we ...

متن کامل

Offline Language-free Writer Identification based on Speeded-up Robust Features

This article proposes offline language-free writer identification based on speeded-up robust features (SURF), goes through training, enrollment, and identification stages. In all stages, an isotropic Box filter is first used to segment the handwritten text image into word regions (WRs). Then, the SURF descriptors (SUDs) of word region and the corresponding scales and orientations (SOs) are extr...

متن کامل

Data Driven Language Transfer Hypotheses

Language transfer, the preferential second language behavior caused by similarities to the speaker’s native language, requires considerable expertise to be detected by humans alone. Our goal in this work is to replace expert intervention by data-driven methods wherever possible. We define a computational methodology that produces a concise list of lexicalized syntactic patterns that are control...

متن کامل

Finding Target Language Correspondence for Lexicalized EBMT System

This paper presents a three-phase approach to find the correspondence in Target Language (TL) sentence for a fragment of Source Language (SL) sentence in a lexicalized EBMT system. To be practical, it exploits surface information as much as possible instead of using parsers. Experiments show that, although not so perfect, it is very robust and effective. The three phases are: First, align the s...

متن کامل

L1 Glossing and Lexical Inferencing: Evaluation of the Overarching Issue of L1 Lexicalization

This empirical study reports on a cross-linguistic analysis of the overarching issue of L1 lexicalization regarding two (non)-interventionist approaches to vocabulary teaching. Participants were seventy four juniors at the Islamic Azad University, Roudehen Branch in Tehran. The investigation pursued (i) the impact of the provided (non)-interventionist treatments on both sets of (non)-lexicalize...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012